Model Selection

Multimodal Large Language Model

# Multimodal Large Language Model

SAIL is a single Transformer model specifically designed for vision and language, serving as a unified Multimodal Large Language Model (MLLM) that seamlessly integrates raw pixel encoding and language decoding within a single architecture.

Internvl3 2B AWQ

InternVL3-2B is an advanced Multimodal Large Language Model (MLLM) developed by OpenGVLab, featuring exceptional multimodal perception and reasoning capabilities, supporting tool usage, GUI agents, industrial image analysis, 3D visual perception, and more.

Transformers Other

InternVL3-1B is a 1B-parameter multimodal large language model in the InternVL3 series, integrating the InternViT visual encoder and Qwen2.5 language model, with exceptional multimodal perception and reasoning capabilities.

Transformers Other

Ovis2-1B is the latest member of the Ovis series of multimodal large language models (MLLM), focusing on structural alignment of vision and text embeddings, featuring high performance for small models, enhanced reasoning capabilities, video and multi-image processing, and multilingual OCR enhancement.

Transformers Supports Multiple Languages

Video-R1-7B is a multimodal large language model optimized based on Qwen2.5-VL-7B-Instruct, focusing on video reasoning tasks, capable of understanding video content and answering related questions.

Transformers English

Finedefics is an open-source multimodal large language model (MLLM) that enhances fine-grained visual recognition (FGVR) capabilities by incorporating object attribute descriptions.

Videorefer 7B Stage2.5

VideoRefer-7B is a multimodal model based on a video large language model, focusing on spatio-temporal object understanding tasks.

Transformers English

P MoD LLaVA NeXT 7B

p-MoD is a hybrid-depth multimodal large language model built using the progressive ratio decay method, supporting image-to-text generation tasks.

Eagle is a series of vision-centric high-resolution multimodal large language models, supporting input resolutions up to 1K and above, excelling in tasks such as optical character recognition and document understanding.

M3D LaMed Llama 2 7B

M3D is a 3D medical image analysis technology based on multimodal large language models, including the M3D-Data dataset, M3D-LaMed model, and M3D-Bench evaluation benchmark.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase